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DESCRIPTION 

Method for Time Aligning Audio Signals Using Characterizations 

Based on Auditory Events 

TECHNICAL FIELD 
Tlie invention relates to audio signals. More paiticularly, the invention relates 
to characterizing audio signals and using characterizations to time align or 
synchronize audio signals wherein one signal has been derived from the other or in 
which both have been derived from the same other signal. Such syncluonization is 
useful, for example, in restoring television audio to video synchronization (lip-sync) 
and in detecting a watennaik embedded in an audio signal (the watennaiked signal is 
compared to an unwatemiaiked version of tlie signal). The invention may be 
implemented so that a low processing power process brings two such audio signals 
into substantial temporal aligmnent. 

BACKGROUND ART 

Tlie division of sounds into units perceived as separate is sometimes referred 
to as "auditoiy event analysis'' or "audiloiy scene analysis" (**ASA"). An extensive 
discussion of auditoiy scene analysis is set forth by Albert S. Bregman in his book 
Auditory Scene Analysis - The Perceptual Organization of Sounds Massachusetts 
Institute of Technology, 1991, Fourth printing, 2001, Second MIT Press paperback 
edition. In addition. United States Patent 6,002,776 to Bhadkamkar, et al, December 
14, 1999 cites publications dating back to 1976 as "prior ait work related to sound 
separation by auditoiy scene analysis." However, the Bhadkamkar, et al patent 
discourages the practical use of auditoiy scene analysis, concluding that 
"[tjecluiiques involving auditoiy scene analysis, although interesting from a scientific 
point of view as models of human auditoiy processing, are cuirently far too 
computationally demanding and specialized to be considered practical tecliniques for 
sound separation until fundamental progress is made.'* 

Bregman notes in one passage that "[w]e hear discrete units when the sound 
changes abmptly in timbre, pitch, loudness, or (to a lesser extent) location in space." 
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(Auditory Scene Analysis - 7 he Perceptual Organization of Sounds supra at page 
469). Bregman also discusses the perception of multiple simultaneous sound streams 
when, for exainple, they are separated in frequency. 

There are many different methods for extracting characteristics or featnires 
S from audio. Provided the features or characteristics are suitably defined, their 

extraction can be performed using automated processes. For exairiple "ISO/IEC JTC 
1/SC 29AVG 1 V (MPEG) is cuirently standardizing a variety of audio descriptors as 
part of the lVfPEG-7 standard. A common shortcoming of such methods is that they 
ignore ASA, Such methods seek to measure, periodically, certain "classical" signal 

10 processing paiameters such as pitch, amplitude, power, hannonic stmcture and 
spectral flatness. Such parameters, while providing useful infbnnation, do not 
analyze and characterize audio signals into elements perceived as separate according 
to human cognition. 

Auditory scene analysis attempts to characterize audio signals in a manner 

15 similar to human perception by identifying elements that are separate according to 
human cognition. By developing such methods, one can implement automated 
processes that accurately perfonn tasks that heretofore would have required human 
assistance. 

Tlie identification of separately perceived elements would allow die unique 
20 identification of an audio signal using substantially less infonnation than the full 
signal itself Compact and unique identifications based on auditoiy events may be 
employed, for example, to identify a signal that is copied from another signal (or is 
copied from the same original signal as another signal). 

25 DISCLOSURE OF THE INVENTION 

A method is described that generates a unique reduced-infonnation 
characterization of an audio signal that may be used to identify the audio signal. Tlie 
characterization may be considered a "signature" or "fingeiprint" of the audio signal. 
According to the present invention, an auditory scene analysis (ASA) is perfonned to 

30 identify auditory events as the basis for characterizing an audio signal. Ideally, die 
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auditory scene analysis identifies auditoiy events that are most likely to be perceived 
by a human Hstener even after the audio has undergone processing, such as low bit 
rate coding or acoustic tiansmission through a loudspeaker. The audio signal may be 
characterized by the boundaiy locations of auditoiy events and, optionally, by the 
5 domijiant frequency subband of each auditoiy event. Tlie resulting infonnation 

pattern, constitutes a compact audio fingeiprint or signature that may be compared to 
tlie fingeiprint or signature of a related audio signal to detennine quickly and/or with 
low processing power the time offset between the original audio signals. The 
reduced-information characteristics have substantially the same relative timing as the 

10 audio signals they represent. 

Tlie auditory scene analysis method according to tlie present invention 
provides a fast and accurate method of time aligning two audio signals, paiticularly 
music, by comparing signatures containing auditory event infonnation. ASA extracts 
information underlying the perception of similaiity, in contrast to traditional methods 

15 tliat extract features less fiindamental to perceiving similarities between audio signals 
(such as pitch amplitude, power, and hannonic structure). Tlie use of ASA improves 
die chance of finding similarity in, and hence time aligning, material tliat has 
undergone significant processing, such as low bit coding or acoustic transmission 
tlirougli a loudspeaker. 

20 In the embodiments discussed below, it is assumed that the two audio signals 

imder discussion aie derived from a common source. The method of the present 
invention detemiines the time pfTset of one such audio signal with respect to the 
otlier so that they may be brought into approximate synchronism with respect to each 
other. 

25 Although in principle the invention may be practiced either in the analog or 

digital domain (or some combination of the two), in practical embodiments of the 
invention, audio signals are represented by samples in blocks of data and processing 
is done in the digital domain. 

Referring to FIG. 1 A, auditory scene analysis 2 is applied to an audio signal in 

30 order to produce a "signature" or "fingerprint," related to tliat signal. In this case. 
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tliere are two audio signals of interest. They are similar in that one is derived from 
tlie other or both have been previously derived from the same original signal. Tlius, 
auditory scene analysis is applied to both signals. For simplicity, FIG. I A shows 
only the application of ASA to one signal. As shown in FIG. IB, the signatures for 
5 the two audio signals. Signature 1 and Signature 2, are applied to a time offset 
calculation function 4 that calculates an "offset" output that is a measure of tlie 
relative time offset between the two signatures. 

Because tlie signatures are representative of the audio signals but are 
substantially shorter {re,, they are more compact or have fewer bits) than the audio 

10 signals from which they were derived, the time offset between the signatures can be 
detennined much faster than it would take to deteniiine the time offset between the 
audio signals. Moreover, because the signatures retain substantially the same relative 
timing relationship as the audio signals from which they are derived, a calculation of 
the offset between the signatures is usable to time align the original audio signals. 

15 Thus, the offset output of function 4 is applied to a time alignment function 6. The 
time alignment function also receives the two audio signals, Audio signal 1 and 
Audio signal 2 (from which Signature 1 and 2 were derived), and provides two audio 
signal outputs. Audio signal 3 and Audio signal 4. It is desired to adjust the relative 
timing of Audio signal 1 with respect to Audio signal 2 so that they are in time 

20 alignment (synchronism) or are neai ly in time aligiunent. To accomplish this, one 
may be time shifted with respect to the other or, in principle, both may be time 
shifted. In practice, one of the audio signals is a **pass through" of Audio signal 1 or 
Audio signal 2 {i.e., it is substantially the same signal) and the other is a time sliifted 
version of the other audio signal that has been temporally modified so that Audio 

25 Signal 3 and Audio Signal 4 are in time synchronism or nearly in time synchronism 
with each other, depending on the resolution accuracy of the offset calculation and 
time aligiunent functions. If greater alignment accuracy is desired, further processing 
may be applied to Audio Signal 3 and/or Audio Signal 4 by one or more other 
processes that fonn no pai1 of the present invention. 
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The time alignment of the signals may be useful, for exainple, in restoring 
television audio to video synchronization (lip-sync) and in detecting a watenxiark 
embedded in an audio signal. In the fonner case, a signature of the audio is 
embedded in the video signal prior to transmission or storage that may result in the 
5 audio and video getting out of synchjonism. At a reproduction point, a signature 
may be derived fiom tlie audio signal and compaied to the signature embedded in the 
video signal in order to restore their synchronism. Systems of that type not 
employing characterizations based on auditoiy scene analysis aje described in U.S. 
Patents Re 33,535, 5,202,761, 6,21 1,919, and 6,246,439, all of which aie 

10 incoiporated herein by reference in their entireties. In the second case, an original 
version of an audio signal is compai ed to a watenxiai ked version of the audio signal 
in order to recover the wateniiark. Such recoveiy requires close temporal alignment 
of the two audio signals. This may be achieved, at least to a first degree of aligmnent 
by deriving a signature of each audio signal to aid in time alignment of the original 

15 audio signals, as explained herein. Fuilher details of FIGS. 1 A and IB aie set forth 
below. 

For some applications, the processes of FIGS. 1 A and IB should be real-time. 
For other applications, they need not be real-time. In a real-time application, the 
process stores a histoiy (a few seconds, for exainple) of the auditoiy scene analysis 

20 for each input signal. Periodically, that event history is employed to update the offset 
calculation in order to continually coirect the time offset. Tlie auditoiy scene 
analysis infonnation for each of the input signals may be generated in real time, or 
the infonnation for either of the signals may already be present (assuming that some 
offline auditoiy scene analysis processing has already been perforaied). One use for 

25 a real-time system is, for example, an audio/video aligner as mentioned above. One 
series of event boundaries is derived from the audio; the other series of event 
boundaries is recovered from the video (assuining some previous embedding of the 
audio event boundaries into the video). Tlie two event boundary sequences can be 
periodically compaied to detemiine the time offset between the audio and video in 

30 order to improve the lip sync, for example. 
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Tlius, both signatures may be generated from the audio signals at neaily the 
same time that the time offset of the signatures is calculated and used to modify tlie 
aligmnent of the audio signals to achieve their substantial coincidence. Alternatively, 
one of the signatures to be compared may be can ied along with the audio signal from 
5 which it was derived, for example, by embedding the signature in another signal, 
such as a video signal as in the case of audio and video aligmnent as just described. 
As a further alternative, both signatures may be generated in advance and only die 
comparison and timing modification perfonned in real time. For example, in the case 
of two sources of the same television prograin (with both video and audio), both with 
10 embedded audio signatures; the respective television signals (with accompanying 
audio) could be synclu'onized (both video and audio) by comparing the recovered 
signatures. Tlie relative timing relationship of the video and audio in each television 
signal would remain unaltered. The television signal synchronization would occur in 
real time, but neither signature would be generated at that time nor simultaneously 
15 witli each other. 

In accordance with aspects of the present invention, a computationally 
efficient process for dividing audio into temporal segments or "auditoiy events" that 
tend to be perceived as separate is provided. 

A powerful indicator of the begiiuiing or end of a perceived auditoiy event is 
20 believed to be a change in spectial content. In order to detect changes in timbre and 
pitch (spectral content) and, as an ancillaiy result, certain changes in amplitude, the 
audio event detection process according to an aspect of the present invention detects 
changes in spectral composition with respect to time. Optionally, according to a 
further aspect of the present invention, the process may also detect changes in 
25 amplitude with respect to time that would not be detected by detecting changes in 
spectral composition with respect to time. 

In its least computationally demanding ijuplementation, tlie process divides 
audio into time segments by analyzing the entire frequency band of the audio signal 
(full bandwidth audio) or substantially the entire frequency band (in practical 
30 implementations, band limiting filtering at the ends of the spectmm aie often 
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employed) and giving the greatest weight to the loudest audio signal components. 
This approach takes advantage of a psychoacoustic phenomenon in which at smaller 
time scales (20 miUiseconds (msec) and less) the ear may tend to focus on a single 
auditoiy event at a given time. This implies that while multiple events may be 
occun ing at the same time, one component tends to be perceptually most prominent 
and may be processed individually as though it were the only event taking place. 
Taking advantage of this effect also allows the auditoty event detection to scale with 
the complexity of the audio being processed. For example, if the input audio signal 
being processed is a solo instilment, the audio events that are identified will likely be 
the individual notes being played; Similarly for an input voice signal, the individual 
components of speech, the vowels and consonants for example, will likely be 
identified as individual audio elements. As the complexity of the audio increases, 
such as music with a drumbeat or multiple instruments and voice, the auditory event 
detection identifies the most prominent (/.e., the loudest) audio element at any given 
moment. Alternatively, the "most prominent" audio element may be detennined by 
taking hearing threshold and frequency response into consideration. 

Optionally, according to further aspects of tlie present invention, at the 
expense of greater computational complexity, the process may also take into 
consideration changes in spectral composition with respect to time in discrete 
frequency bands (fixed or dynamically detennined or both fixed and dynamically 
detennined bands) rather than the full bandwidth. This alternative approach would 
take into account more than one audio stream in different frequency bands rather than 
assuming that only a single stieam is perceptible at a particular time. 

Even a simple and computationally efficient process according to an aspect of 
the present invention for segmenting audio has been found useful to identify auditory 
events. 

An auditory event detecting process of the present invention may be 
implemented by dividing a time domain audio wavefonn into time intervals or blocks 
and then converting the data in each block to the frequency domain, using either a 
filter bank or a time-frequency transformation, such as a Discrete Fourier Transform 
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(DFT) (implemented as a Fast Fourier Traiisfonn (FFT) for speed). Tlie amplitude of 
the spe;ctral content of each block may be nonnalized in order to eliminate or reduce 
the effect of amplitude changes. Each resulting frequency domain representation 
provides an indication of the spectral content (amplitude as a function of frequency) 
5 of the audio in the particular block. Tlie spectral content of successive blocks is 
compared and each change greater than a threshold may be taken to indicate the 
temporal start or temporal end of an auditoiy event. 

In order to minimize the computational complexity, only a single band of 
frequencies of the time domain audio wavefonri may be processed, preferably either 
10 the entire frequency band of the spectiiim (which may be about 50 Hz to 15 kHz in 
the case of an average quality music system) or substantially the entire frequency 
band (for exajnple, a band defining filter may exclude the high and low frequency 
extremes). 

Preferably, the frequency domain data is nonnalized, as is described below, 

15 The degree to which the frequency domain data needs to be nonnalized gives an 
indication of amplitude. Hence, if a change in this degree exceeds a predetennined 
tlireshold, that too may be taken to indicate an event boundaiy. Event stail and end 
points resulting from spectial changes and from amplitude changes may be ORed 
together so that event boundaries resulting from either type of change aie identified. 

20 In practical embodiments in which the audio is represented by samples divided 

into blocks, each auditory event temporal start and stop point boundaiy necessarily 
coincides with a boundaiy of the block into which the time domain audio wavefomi 
is divided. Tliere is a ti ade off between real-time processing requirements (as larger 
blocks require less processing overhead) and resolution of event location (smaller 

25 blocks provide more detailed infonnation on the location of auditoiy events). 

As a further option, as suggested above, but at the expense of greater 
computational complexity, instead of processing the specti al content of the time 
domain wavefonn in a single band of frequencies, the specti-um of the time domaiin 
wavefonn prior to frequency domain conversion may be divided into two or more 

30 frequency bands. Each of the frequency bands may then be converted to the 
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frequency domain and piocesseid as though it were an independent channel. Tlie 
resulting event boundaries may then be ORed together to defme the event boundaries 
for tliat channel. The multiple fiequency bands may be fixed, adaptive, or a 
combination of fixed and adaptive. Tracking filter techniques employed in audio 
5 noise reduction and other aits, for example, may be employed to define adaptive 

frequency bands (e.g., dominant simultaneous sine waves at 800 Hz and 2 kHz could 
result in two adaptively-detennined bands centered on those two frequencies). 

Other teclmiques for providing auditory scene analysis may be employed to 
identify auditory events in the present invention. 

10 

DESCRIPTION OF THE DRA WINGS 
FIG. 1 A is a flow chait showing the process of exti action of a signature fiom 
an audio signal in accordance with the present invention. Tlie audio signal may, for 
example, represent music {e,g., a musical composition or "song"). 
15 FIG. IB is a flow chait illustrating a process for the time alignment of two 

audio signal signals in accordance with the present invention. 

FIG. 2 is a flow chait showing the process of extraction of audio event 
locations and the optional extraction of dominant subbands from an audio signal in 
accordance with the present invention. 
20 FIG. 3 is a conceptual schematic representation depicting the step of spectral 

analysis in accordance with the present invention. 

FIGS. 4A and 4B are idealized audio wavefonns showing a plurality of 
auditory event locations and auditory event boundaries in accordance with the 
present invention. 

25 

BEST MODE FOR CARRYING OUT THE INVENTION 

In a practical embodiment of the invention, the audio signal is represented by 
samples that are processed in blocks of 5 12 samples, which corresponds to about 
11. 6 msec of input audio at a sampling rate of 44. 1 kHz. A block length having a 
30 time less than the duration of the shortest perceivable auditoiy event (about 20 msec) 
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is desirable. It will be understood that the aspects of the invention are not iiinited to 
such a practical embodiment. The principles of the invention do not require 
aiTanging the audio into sample blocks prior to detennining auditory events, nor, if 
they are, of providing blocks of constant length. However, to minimize complexity, 
5 a fixed block length of 5 12 samples (or some other power of two number of samples) 
is useful for three primaiy reasons. First, it provides low enough latency to be 
acceptable for real-time processing applications. Second, it is a power-of-two 
number of samples, which is useful for fast Fourier transfonn (FFT) analysis. Fliird, 
it provides a suitably lajge window size to perfonn useful auditoiy scene analysis. 
10 In tlie following discussions, the input signals aie assumed to be data with 

amplitude values in the range [-l,-»-l]. 

Auditory Scene Analysis 2 (FIG. I A) 
Following audio input data blocking (not shown), the input audio signal is 
divided into auditoiy events, each of which tends to be perceived as separate, in 

15 process 2 ("Auditory Scene Analysis") of FIG. 1 A. Auditory scene analysis may be 
accomplished by an auditoiy scene analysis (ASA) process discussed above. 
Although one suitable process for peifbnning auditory scene analysis is described in 
further detail below, the invention contemplates that other useful techniques for 
perfbnuing ASA may be employed. 

20 FIG. 2 outlines a process in accordance with teclmiques of the present 

invention that may be used as the auditoiy scene analysis process of FIG, 1 A, The 
ASA step or process 2 is composed of tlu^ee general processing substeps. Tlie first 
substep 2-1 ("Peiibnn Spectral Analysis") takes the audio signal, divides it into 
blocks and calculates a spectral profile or specti al content for each of the blocks. 

25 Spectral analysis transfonns the audio signal into the short-tenn frequency domain. 
Tliis can be perfonned using any filterbank; either based on transforms or banks of 
band-pass filters, and in either linear or vvaiped frequency space (such as the Baik 
scale or critical band, which better approximate the chai acteristics of the human ear). 
With any filterbank there exists a tradeoff between time and frequency. Greater time 

30 resolution, and hence shorter time intervals, leads to lower frequency resolution. 
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Greater frequency resolution, and hence hairower subbands, leads to longer time 
intervals. 

The fu st substep calculates the specti al content of successive time segments of 
the audio signal. In a practical embodiment, described below, the ASA block size is 
5 512 samples of the input audio signal (FIG. 3). In the second substep 2-2, the 

differences in spectial content from block to block are detennined ("Perfonn spectral 
profile difference measurements"). Thus, the second substep calculates the 
difference in specti al content between successive time segments of the audio signal. 
In the third substep 2-3 ("Identify location of auditory event boundaries"), when the 

10 spectral difference between one spectral -profile block and the next is greater than a 
du'eshold, tlie block boundaiy is taken to be an auditoiy event boundaiy. Tlius, tlie 
third substep sets an auditory event boundaiy between successive time segments 
when the difference in the spectral profile content between such successive time 
segments exceeds a tlu eshold. As discussed above, a powerful indicator of the 

15 begiiuiing or end of a perceived auditoiy event is believed to be a change in specti al 
content. Tlie locations of event boundaries are stored as a signature. An optional 
process step 2-4 ("Identify dominant subband") uses the spectial analysis to identify 
a dominant frequency subband that may also be stored as pail of the signature. 

In this embodiment, auditory event boundaiies define auditoiy events having a 

20 length that is an integral multiple of spechal profile blocks with a minimum length of 
one spectral profile block (512 sainples in diis example). In principle, event 
boundaiies need not be so limited. 

Either overlapping or non-overlapping segments of the audio may be 
windowed and used to compute spectral profiles of the input audio. Overlap results 

25 in finer resolution as to the location of auditory events and, also, makes it less likely 
to miss an event, such as a transient. However, as time resolution increases, 
frequency resolution decreases. Overlap also increases computational complexity. 
Tlius, overlap may be omitted. FIG. 3 shows a conceptual representation of non- 
overlapping 512 sample blocks being windowed and transfonned into the frequency 

30 domain by the Discrete Fourier Transfonn (DFT). Each block may be windowed and 
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traiisformed into the frequency domain, such as by using the DFT, preferably 
implemented as a Fast Fourier Transfonn (FFT) for speed. 

Tlie following variables may be used to compute the spectral profile of the 
input block: 

N = number of samples in the input signal 

M = number of windowed samples used to compute spectral profile 
P = number of samples of spectral computation overlap 
Q = number of spectral windows/regions computed 
In general, any integer numbers may be used for the variables above. 
However, die implementation will be more efficient if M is set equal to a power of 2 
so that standard FFTs may be used for the spectral profile calculations. In a practical 
embodiment of the auditory scene analysis process, the parameters listed may be set 
to: 

M =512 samples (or 1 1.6 msec at 44. 1 kHz) 

P =0 samples (no overlap) 
Tlie above-listed values were detennined experimentally and were found 
generally to identify with sufficient accuracy the location and duration of auditory 
events. However, setting the value of P to 256 samples (50% overlap) has been 
found to be useful in identifying some hard-to-find events. While many different 
types of windows may be used to minimize spectral artifacts due to windowing, the 
window used in the spectial profile calculations is an M-point Hamiing, Kaiser- 
Bessel or other suitable, preferably non-rectangular, window. Tlie above-indicated 
values and a Hamung window type were selected after extensive experimental 
analysis as they have shown to provide excellent results across a wide range of audio 
material. Non-rectangular windowing is prefened for the processing of audio signals 
widi predominantly low fi^equency content. Rectangular windowing produces 
spectral artifacts that may cause incon ect detection of events. Unlike certain codec 
applications where an overall overlap/add process must provide a constant level, such 
a constraint does not apply here and the window may be chosen for characteristics 
such as its time/fiequency resolution and stop-band rejection. 
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In substep 2-1 (FIG. 2), the spectrum of each M-sainple block may be 
computed by windowing the data by an M-point Hanning, Kaiser-Bessel or other 
suitable window, converting to the jBequency domain using an M-point Fast Fourier 
Transfonn, and calculating the magnitude of the FFT coefficients. Tlie resultant data 
5 is nonnalized so that the largest magnitude is set to unity, and the nonnalized array 
of M numbers is converted to the log domain. The array need not be converted to the 
log domain, but the conversion simplifies tlie calculation of the difference measure in 
substep 2-2. Furthennore, the log domain more closely matches the log domain 
ainplitude nature of the human auditoiy system. The resulting log domain values 

10 have a range of minus infmity to zero. In a practical emboduiient, a lower limit can 
be imposed on the range of values; the limit may be fixed, for example -60 dB, or be 
frequency-dependent to reflect the lower audibility of quiet sounds at low and veiy 
high frequencies. (Note that it would be possible to reduce the size of the airay to 
M/2 in that the FFT represents negative as well as positive frequencies). 

15 Substep 2-2 calculates a measure of the difference between the spectra of 

adjacent blocks. For each block, each of the M (log) specti al coefficients from 
substep 2-1 is subtracted from the corresponding coefficient for tlie preceding block, 
and tlie magnitude of die difference calculated (the sign is ignored). These M 
differences are then summed to one number. Hence, for the whole audio signal, the 

20 result is an array of Q positive numbers; the greater the number the more a block 

differs in spectioim fiom the preceding block. This difference measure could also be 
expressed as an average difference per spectral coefficient by dividing the difference 
measure by the nuinber of spectral coefficients used in the sum (in this case M 
coefficients). 

25 Substep 2-3 identifies the locations of auditory event boundaries by applying a 

tlireshold to the an ay of difference measures from substep 2-2 with a tlireshold value. 
When a difference measure exceeds a threshold, the change in spectiAim is deemed 
sufficient to signal a new event and the block number of the change is recorded as an 
event boundaiy. For the values of M and P given above and for log domain values 

30 (in substep 2-1) expressed in units of dB, the threshold may be set equal to 2500 if 
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the whole magnitude FFT (including die iniiTored pait) is compared or 1250 if half 
the FFT is compaied (as noted above, the FFT. represents negative as well as positive 
frequencies — for the magnitude of the FFT, one is the minor image of the odier). 
This value was chosen experimentally and it provides good auditory event boundary 
5 detection. This paiameter value may be changed to reduce (increase the threshold) or 
increase (decrease the threshold) the detection of events. 

Tlie details of this practical embodiment are not critical. Other ways to 
calculate the spectial content of successive time segments of the audio signal, 
calculate the differences between successive time segments, and set auditoiy event 

10 boundaiies at the respective boundaries between successive time segments when the 
difference in the spectral profile content between such successive time segments 
exceeds a tlireshold may be employed. 

For an audio signal consisting of Q blocks (of size M samples), the output of 
the auditory scene analysis process of function 2 of FIG. I A is an array Big) of 

15 information representing the location of auditoiy event boundaries where ^ = 0, 1, . . 
. , Q-1 . For a block size of M = 5 12 samples, overlap of P = 0 samples and a signal- 
sampling rate of 44. 1 kHz, the auditoiy scene analysis function 2 outputs 
approximately 86 values a second. Preferably, the airay B{q) is stored as the 
signature, such that, in its basic fonn, without the optional dominant subband 

20 frequency infonnation, the audio signal's signature is an airay B(qJ representing a 
string of auditory event boundai ies. 

An example of the results of auditoiy scene analysis for two different signals is 
shown in FIGS. 4A and 4B. Tlie top plot, FIG. 4 A, shows the results of auditoiy 
scene processing where auditoiy event boundaries have been identified at samples 

25 1024 and 1536. Tlie bottom plot, FIG. 4B, shows the identification of event 
boundaries at samples 1024, 2048 and 3072. 

Identify dominant subband (optional) 
For each block, an optional additional step in the ASA processing (shown in 
FIG. 2) is to exti act infonnation from the audio signal denoting the dominant 

30 frequency "subband" of the block (conversion of the data in each block to the 
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frequency domain results in inlbhiiatioii divided into frequency subbands). This 
block-based infonnation may be converted to auditoiy-event based infonnation, so 
that tlie dominant frequency subband is identified for eveiy auditory event. Tliis 
information for every auditoiy event provides the conelation processing (described 
5 below) with further infonnation in addition to the auditory event boundaiy 

infomiation. The dominant (largest amplitude) subband may be chosen from a 
plurality of subbands, three or four, for example, that are within tlie range or band of 
frequencies where the human ear is most sensitive. Altematively, other criteria may 
be used to select the subbands. 

10 The spectrum may be divided, for example, into tluee subbands. The preferred 

frequency range of the subbands is: 

Subband 1 30 1 Hz to 560Hz 

Subband 2 560Hz to 1 938Hz 

Subband 3 1938Hz to 9948Hz 

15 To detennine the dominant subband, the squai e of the magnitude specli*um (or 

tlie power magnitude spectmm) is summed for each subband. This resulting sum for 
each subband is calculated and the largest is chosen. The subbands may also be 
weighted prior to selecting the lai gest. Tlie weighting may take the fonn of dividing 
the sum for each subband by the number of spectral values in the subband, or 

20 altematively may take the fomi of an addition or multiplication to emphasize the 
importance of a band over another. This can be useful where some subbands have 
more energy on average than other subbands but are less perceptually important. 

Considering an audio signal consisting of Q blocks, the output of the dominant 
subband processing is an airay D.V(^) of infonnation representing the dominant 

25 subband in each block (qr = 0, 1, . . . , Q-I). Preferably, the aiTay DS{q) is stored in 
the signature along with the array B(qJ. Tlius, with the optional dominant subband 
infr)nnation, the audio signal's signature is two arrays B(qJ and DS(q), representing, 
respectively, a string of auditoiy event boundaries and a dominant frequency subband 
within each block. Tlius, in an idealized example, the two arrays could have the 

30 fr>llowing values (for a case in which there are tluee possible dominant subbands). 



4 
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lOlOOOIOOlOOOOOlO (Event Boundaries) 
1 122221 1 133333311 (Dominant Subbands) 

In most cases, the dominant subband remains the same within each auditoiy 
event, as shown in this example, or has an average value if it is not unifomi for all 
blocks within tlie event. Thus, a dominant subband may be detenuined for each 
auditory event and the array DS(q) may be modified to provide that the same 
dominant subband is assigned to each block within an event. 

Time OJfsei Calculation 

The output of tlie Signature Extraction (FIG. 1 A) is one or more arrays of 
auditory scene analysis infomiatioii that aie stored as a signature, as described above. 
The Time Offset Calculation function (FIG. IB) takes two signatures and calculates a 
measure of their time offset. This is perfonned using known cross correlation 
methods. 

Let .V, (length ) be an anay fiom Signature 1 and (length Q^) an array 
from Signature 2. First, calculate the cross-coirelation ajray ^c^^^Csee, for example, 

John G. Proakis, Dimiti is G. Manolakis, Digital Signal Processing: Principles, 
Algorithms, and Applications, Macmillan Publishing Company, 1992, ISBN 0-02- 
3968 15-X). 

^^.H,(0= / = 0+i,±2,.... (1) 

In a practical embodiment, the cross-conelation is perfonned using standard 
FFT based techniques to reduce execution time. 
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Since both S\ and are ifinite in length, the non-zero component of R^^^^ has 
a length of Q, - 1 . Tlie lag / corresponding to the inaximuiTi element in R^^^^ 
represents the time offset of ..V, relative to S, . 



= / for MAx(r,^^^ (/)) (2) 



Tliis offset has the same units as the signature anays and . In a practical 
implementation, the elements of .V, and S\ have an update rate equivalent to tlie 
audio block size used to generate the an ays minus the overlap of adjacent blocks: 
10 that is. A/ - P = 5J2 -0 = 512 samples. Tlierefore the offset has units of 5 1 2 audio 
samples. 

l ime Alignmeni 

Tlie Time Alignment function 6 (FIG. IB) uses the calculated offset to time 
align the two audio signals. It takes as inputs. Audio Signals 1 and 2 (used to 
15 generate the two signatures) and offsets one in relation to the other such that they are 
both more closely aligned in tiine. The two aligned signals are output as Audio 
Signals 3 and 4, The ainount of delay or offset applied is the product of the relative 
signature delay /^^j^ between signature and S^, and the re$olution>/-P, in samples, 

of the signatures. 

20 For applications where only the passage common to the two sources is of 

interest (as in the case of watennark detection where unmarked and marked signals 
are to be directly compared), the two sources may be truncated to retain only that 
common passage. 

* 

For applications where no infonnation is to be lost, one signal may be offset 
25 by the insertion of leading samples. For example let x {n) be the samples of Audio 

Signal 1 with a length of W, samples, and x-,in) be the samples of Audio Signal 2 

witli a length of samples. Also l^^ represents the offset of relative to in 

units of M'P audio samples. 
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Tlie sample offset Djj of Audio Signal 2 relative to Audio Signal 1 is the 
product of the signature offset 1^^ and A/-P. 



D^,^1,.^{m-f) (3) 



If D^, is zero, the both input signals are output unmodified as signals 3 and 4 
(see FIG. IB). If Z)^! positive then input signal ;c,(«) is modified by inserting 
leading saixipies. 



10 x\{m) = 



0 a<m<D„ 

(4) 



Signals xjo?) and x^{n) are output as Signals 3 and 4 (see FIG. IB). 
If i)^, is negative then input signal x^{n) is modified by inserting leading samples. 



f 0 0<m<-Ai 
15 x,{m) = \ 

(^x, (/i) 0<n <L^ m=n- Dj, 



(5) 



Cowpulaiion Complexity and Accuracy 
Tlie computational power required to calculate the offset is proportional to the 
lengdis of the signature airays, O, and O^. Because the process described has some 
20 offset eiTor, the time alignment process of the present invention may be followed by 
a conventional process having a finer resolution that works directly with tlie audio 
signals, ratlier than signatures. For exainple such a process may take sections of the 
aligned audio signals (slightly longer than the offset eiror to ensure some overlap) 
and cross correlate the sections directly to detemiine the exact sainple enor or fine 
25 offset. 

Since the signature aiTays are used to calculate the sample offset, the accuracy 
of the time aligninent method is limited to tlie audio block size used to generate the 
signatures: in this implementation, 5 12 samples. In other words this method will 
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have error in the sainple offset of approxutiately plus/minus half the block size: in 
this implementation ±256 sainples. 

Tliis en or can be reduced by increasing the resolution of the signatures; 
however there exists a tiadeoff between accuracy and computational complexity. 
5 Lower offset error requires finer resolution in the signature airays (more array 
elements) and this requires higher processijig power in computing the cross 
correlation. Higher offset eiror requires coaiser resolution in the signature arrays 
(less array elements) and this requires lower processing power in computing the cross 
correlation. 

1 0 A pplicalions 

Watennajking involves embedding infonnation in a signal by altering the 
signal in some predefined way, including the addition of other signals, to create a 
marked signal. Tlie detection or exti action of embedded infonnation often relies on a 
comparison of the marked signal with the original source. Also the marked signal 

15 often undergoes other processing including audio coding and speaker/microphone 
acoustic path transmission. The present invention provides a way of time aligning a 
marked signal with the original source to then facilitate the extraction of embedded 
infonnation. 

Subjective and objective methods for detennining audio coder quality compare 
20 a coded signal with the original source, used to generate the coded signal, in order to 
create a measure of the signal degradation (for example an ITU-R 5 point impainnent 
score). Tlie comparison relies on time aligiunent of the coded audio signal with the 
original source signal. Tliis method provides a means of time aligning the source and 
coded signals. 

25 Other applications of the invention are possible, for example, improving the 

lip-syncing of audio and video signals, as mentioned above. 

It should be understood that unplementation of other vaiiations and 
modifications of the invention and its vai ious aspects will be apparent to those skilled 
in the art, and that the invention is not limited by these specific embodiments 

30 described. It is therefore contemplated to cover by the present invention any and all 
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modifications, variations, or equivalents that fall within the ti*ue spirit and scope of 
the basic underlying principles disclosed and claimed herein. 

The present invention and its vai ious aspects may be implemented as software 
functions performed in digital signal processors, programmed general-purpose digital 
computers, and/or special purpose digital computers, hiterfaces between analog and 
digital signal streams may be perfonned in appropriate hardware and/or as functions 
ill software and/or fi mi ware. 
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CLADMS 

1 . A method for time aligning audio signals, wherein one signal has been 
derived from die other or both have been derived from another signal, comprising 

deriving reduced-infonnation characterizations of said audio signals, wherein 
said reduced-infonnation characterizations are based on auditory scene analysis, 

calculating the time offset of one characterization with respect to the odier 
ch arac t erizati on , 

modifying the temporal relationship of said audio signals wiUi respect to each 
otiier in response to said tiine ofTset such that said audio signals are substantially 
coincident with each other 

2. The method of claim I wherein said reduced-infonnation chai acterizations 
are derived from said audio signals and embedded in respective other signals that are 
carried widi the audio signals from which tliey were derived prior to said calculating 
and modifying. 

3. Tlie method of claim 2 wherein said other signals are the video portion of a 
television signal and said audio signal are the audio portion of the respective 
television signal. 

4. A method for time aligning an audio signal and another signal, comprising 
deriving a reduced-infonnation characterization of the audio signal and 

embedding said chai acterization in the other signal when the audio signal and other 
signal are substantially in syncluonism, wherein said characterization is based on 
auditory scene analysis, 

recovering the embedded characterization of said audio signal from said other 
signal and deriving a reduced-infonnation characterization of said audio signal from 
said audio signal in the same way the embedded characterization of the audio signal 
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was derived based on auditoiy scene analysis, after said audio signal and said other 
signal have been subjected to differential time offsets, 

calculating the time offset of one characterization with respect to the other 

ch ar acted zati on , 

5 modifying the temporal relationship of the audio signal with respect to the 

otJier signal in response to said time offset such that the audio signal and video signal 
are substantially in synchronism with each other. 

5. The method of claim 4 wherein said other signal is a video signal. 

10 

6. The method of claim I or claim 4 wherein calculating the time offset 
includes perfonning a cross-correlation of said characterizations. 

7. The method of any one of claims 1-6 wherein said reduced-information 
15 characterizations based on auditoiy scene analysis are arrays of information 

representing at least the location of auditory event boundaries. 

8. Tlie method of claijn 7 wherein said auditory event boundaries are 
determined by 

20 calculating the specti al content of successive time segments of said audio 

signal, 

calculating the difference in spectral content between successive time 
sgements of said audio signal, and 

identifying an auditoiy event boundary as the boundaiy between successive 
25 time segments when the difference in the spectral content between such successive 
time segments exceeds a tlueshold. 

9- The method of claim 7 or claiin 8 wherein said arrays of infonnation also 
represent the dominant frequency subband of each of said auditory events. 
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